The start date/time and duration of each trip can be used to understand how long a trip typically takes and when it is most likely to occur. The user information, such as user type, gender, and age, can be used to identify the main target customer groups. By summarizing the bike usage data for different groups of riders, we can see if there are any special patterns associated with a specific group.
For example, we might find that subscribers tend to take longer trips than customers, or that men tend to take more trips than women. We might also find that younger people tend to take shorter trips than older people.
This information can be used to improve the bike-sharing service by targeting different groups of riders with different marketing messages. For example, we might target subscribers with messages about longer trips, or we might target women with messages about safety.
Here are some specific questions you could ask:
What is the average trip duration for subscribers? For customers? What are the most popular times of day for trips? For weekdays? For weekends? What are the most popular start and end stations? Do men or women take more trips? Do younger or older people take more trips? Do subscribers or customers take longer trips? By asking these questions and exploring the data, you can learn more about how people are using the bike-sharing service and how it can be improved.
Dataset Overview¶
The original combined data set contains approximately 183,412 individual trip records. There are 16 variables in this data set, which can be divided into three major categories:
Trip duration: This category includes the duration_sec, start_time, and end_time variables. These variables provide information about the length of each trip, the time the trip started, and the time the trip ended. Station information: This category includes the start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, and end_station_longitude variables. These variables provide information about the start and end stations for each trip. Member information (anonymized): This category includes the bike_id, user_type, member_birth_year, member_gender, and bike_share_for_all_trip variables. These variables provide information about the bike used for each trip, the user type (member or casual rider), the member's birth year, the member's gender, and whether the member has a bike share pass for all trips. In addition to the original variables, the following derived features were created to assist with exploration and analysis:
Trip information: The duration_min variable was created by dividing the duration_sec variable by 60. The s_day, and s_hour, were created by extracting the corresponding information from the start_time variable. Member: The member_age variable was created by calculating the age of the member based on their birth year.
# import all packages and set plots to be embedded inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
sns.set()
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
# load in the dataset into a pandas dataframe
bike_19=pd.read_csv("cleand_fordgobike-tripdata.csv")
fig = px.histogram(
data_frame=bike_19,
x='duration_min',
title='Distribution of Trip Durations'
)
fig.update_layout(
xaxis_title='Duration (minutes)',
yaxis_title='Count',
xaxis_range=[0, 70]
)
fig.show()
## creating a function that carries the visualizations titles,x_label and y_label
def plot_label(title,x_label,y_label):
plt.title(title)
plt.xlabel(x_label)
plt.ylabel(y_label)
plt.xticks(rotation=45)
## Explore the trip distribution along a day
color_base=sns.color_palette()[0]
sns.countplot(data=bike_19,x='s_hour',color=color_base)
plot_label("Trip Start Hour Of The Day","Hour Of Day","Count")
plt.show();
## Explore the trip distribution along a week
days = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
sns.countplot(data=bike_19,x='s_day',color=color_base,order=days)
plot_label("Trip Start Day Of The Week ","Day Of The Week","Count")
sns.violinplot(data=bike_19.query('duration_min<=60'), x='user_type', y='duration_min', color=color_base,inner='quartail')
plot_label("trip duration distribution",'User Type','Trip Duration in Minutes')
sns.barplot(data=bike_19,
x="s_day",
y="duration_min",
order=days,
color=color_base)
plot_label("Trip Duration by Day of Week","Day of Week",'Average Trip Duration (minutes)')
sns.barplot(data=bike_19,x="s_day",y="member_age",color=color_base,order=days)
plot_label("AVG member age due to the day of the week","Day Of The Week","Age");
sns.countplot(data=bike_19,x='s_hour',hue='user_type')
plot_label("Hour of the day vs user_type","Hour Of Day","Count");
sns.countplot(data=bike_19,x='s_day',hue='user_type',order=days)
plot_label("weekly day vs user_type","User Type","Count");
!jupyter nbconvert Part_2_Ford_GoBike_System_Data.ipynb --to slides --post serve --no-input --no-prompt